<p>
  Implement a CUDA program to calculate the categorical cross-entropy loss for a batch of predictions.
  Given a matrix of predicted logits \(Z\) of size \(N \times C\) and a vector of true class labels <code>true_labels</code> of size \(N\), compute the average cross-entropy loss over the batch.
  The loss for a single sample \(j\) with logits \(z_j = [z_{j1}, \ldots, z_{jC}]\) and true label \(y_j\) is calculated using the numerically stable formula:
  \[ \text{Loss}_j = \log\left(\sum_{k=1}^{C} e^{z_{jk}}\right) - z_{j, y_j} \]
  The final output stored in the <code>loss</code> variable should be the average loss over the \(N\) samples:
  \[ L = \frac{1}{N} \sum_{j=1}^{N} \text{Loss}_j \]
  The input parameters are <code>logits</code>, <code>true_labels</code>, <code>N</code> (number of samples), and <code>C</code> (number of classes). The result should be stored in <code>loss</code> (a pointer to a single float).
</p>

<h2>Implementation Requirements</h2>
<ul>
  <li>External libraries are not permitted</li>
  <li>The <code>solve</code> function signature must remain unchanged</li>
  <li>The final result (average loss) must be stored in <code>loss</code></li>
</ul>

<h2>Example 1:</h2>
<pre>Input:  N = 2, C = 3
        logits = [[1.0, 2.0, 0.5], [0.1, 3.0, 1.5]]
        true_labels = [1, 1]
Output: loss = [0.3548926]</pre>


<h2>Example 2:</h2>
<pre>Input:  N = 3, C = 4
        logits = [[-0.5, 1.5, 0.0, 1.0], [2.0, -1.0, 0.5, 0.5], [0.0, 0.0, 0.0, 0.0]]
        true_labels = [3, 0, 1]
Output: loss = [0.98820376]</pre>

<h2>Constraints</h2>
<ul>
  <li>1 &le; <code>N</code> &le; 10,000</li>
  <li>2 &le; <code>C</code> &le; 1,000</li>
  <li>-10.0 &le; <code>logits[i, j]</code> &le; 10.0</li>
  <li>0 &le; <code>true_labels[i]</code> &le; <code>C</code></li>
</ul>